The previous two articles — [Big Data Computing:Batch Processing](https://xx/Big Data Computing:Batch Processing) and [Big Data Computing:Real-Time Processing](https://xx/Big Data Computing:Real-Time Processing) — introduced the principles, architectures, frameworks, application scenarios, and limitations of batch and real-time computing. In [Big Data Computing:Batch Processing vs. Real-Time Computing](https://xx/Big Data Computing:Batch Processing vs. Real-Time Computing), we compared the two approaches across multiple dimensions to understand their characteristics, limitations, and use cases.
This article explores how batch and real-time processing — originally two parallel development paths — have gradually converged into stream-batch unification due to evolving business needs and technological progress. We’ll analyze why unification became necessary, its core principles, and its architecture.
Why Stream-Batch Unification?
Limitations of Traditional Architectures
In traditional big data platforms, batch and real-time processing followed parallel tracks:
- Batch processing excels at analyzing massive historical datasets in bulk with high latency. Common frameworks: Hadoop, Spark, Hive.
- Real-time processing handles continuous data streams with millisecond- or second-level latency, but lacks the ability to analyze full datasets. Common frameworks: Flink, Paimon, Kafka.
This separation introduces several problems:
- Duplicate business logic: The same business requirement often needs two separate implementations — one for batch and one for streaming.
- Data inconsistency: Different data paths in batch and stream pipelines lead to result discrepancies.
- High cost: Two sets of programs must be developed and maintained by different teams.
- Low resource utilization: Separate frameworks occupy independent clusters, each requiring reserved capacity for peak loads.
Business Drivers
As businesses demand both real-time responsiveness and historical analysis, maintaining two systems becomes increasingly costly. This drove exploration into unifying batch and stream processing.
What Is Stream-Batch Unification?
Stream-batch unification means using a single computation engine and programming model to support both batch and real-time workloads while ensuring consistent results. Its key characteristics are:
- Unified engine: Internally adapts to batch or streaming modes.
- Unified API: Developers write code once, runnable in either mode.
- Unified data path: Supports bounded and unbounded datasets from the same source.
- Unified resources: Shared compute resources improve utilization.
Architecture of Stream-Batch Unification
A typical unified architecture consists of four layers: data sources, compute engine, storage, and applications.
- Data sources: Use Kafka, Pulsar, or sockets to continuously feed data.
- Compute engine: Unified APIs (e.g., Flink SQL/Table API) abstract business logic. Execution plans adapt dynamically to batch or stream inputs.
- Storage: Results written to HDFS, data lakes, Kafka, or NoSQL stores, depending on use case.
- Applications: BI dashboards, data warehouse queries, monitoring/alerting, recommendation, and risk control.
Technology Stack
1. Messaging Layer
- Role: Data ingestion, buffering, and distribution.
- Tech: Kafka, Pulsar.
2. Compute Engine
- Role: Provide unified APIs and execution plans.
- Tech: Flink.
3. Data Lake
- Role: Unified storage formats with stream-batch read/write support, ensuring visibility and transactional integrity.
- Tech: Hudi, Delta Lake, Paimon.
4. Query Engines
- Role: Low-latency query responses.
- Tech: Redis, ClickHouse, Doris.
Despite being relatively new, the ecosystem for stream-batch unification is already rich, with multiple technical options tailored for different application scenarios. Typical examples include:
1. Real-Time Data Warehouse
E-commerce and social platforms rely heavily on real-time data warehouses for second-level metrics and long-term trend analysis. A common setup includes:
- Kafka for ingesting business event streams.
- Flink for real-time aggregation and computation.
- Results stored in ClickHouse or Doris for sub-second querying.
- Offline jobs ingesting into Hudi for deep historical analysis.
- Unified Flink SQL ensures consistent logic across batch and streaming.
2. Real-Time Financial Risk Control
Financial platforms often need to monitor transactions in real time, intercept suspicious activity, and maintain historical data for retrospective analysis. A typical setup:
- Kafka ingests risk-control event streams.
- Flink performs real-time rule-based monitoring.
- Daily offline jobs train risk models on historical data.
- Both batch and stream tasks share the same Flink Table API logic.
Conclusion
The rise of real-time computing highlighted the value of fresh data, while batch processing underscored the importance of historical insights. This convergence has driven technology away from two parallel paths toward stream-batch unification — not just as a technical combination, but as an architectural paradigm shift.